The purpose of using this data set is to build models that predict student academic success. However, a significant challenge associated with this endeavor is the strong class imbalance in the data, as described by the authors in “Early Prediction of student’s Performance in Higher Education: A Case Study.” The goal of this report is to use exploratory data analysis to examine the features closely correlated with student academic performance, as measured by first- and second-semester grade average. Understanding these relationships may help inform the development of machine learning models that more accurately predict academic success while mitigating bias.
This dataset contains features provided by a higher education institution in Portugal and collected from various disjoint databases about students at the time of their enrollment in an undergraduate degree program. The purpose of the dataset is to predict student dropout and academic success. The features relate to student demographics, socio-economic indicators, degree program, and academic performance at the end of the first and second semesters. The collected data previously underwent cleaning.
The dataset contains 4,424 observations and 34 feature variables.
setwd("~/Desktop/Education/WCU Coursework/Spring 2025/STA552/proj/1/1.1")
data <- read.csv(file = "student-success.csv")
# Remove the last feature ("Target" does not include any values)
data <- data[, -ncol(data)]
# Create random observation ID and replace 8% of the corresponding observations with missing values
grades1.missing.id <- sample(1:4424, 353, replace = FALSE)
grades2.missing.id <- sample(1:4424, 354, replace = FALSE)
data$Curricular.units.1st.sem..grade.[grades1.missing.id] <- NA
data$Curricular.units.2nd.sem..grade.[grades2.missing.id] <- NA
#Total the number of missing values per column.
missing_per_column <- colSums(is.na(data))
#Total the number of missing values in the dataset.
total_missing <- sum(is.na(data))
# Original column names
colnames(data) <- c(
"Marital.status", "Application.mode", "Application.order", "Course",
"Daytime.evening.attendance", "Previous.qualification", "Nacionality",
"Mother.s.qualification", "Father.s.qualification", "Mother.s.occupation",
"Father.s.occupation", "Displaced", "Educational.special.needs", "Debtor",
"Tuition.fees.up.to.date", "Gender", "Scholarship.holder", "Age.at.enrollment",
"International", "Curricular.units.1st.sem..credited.", "Curricular.units.1st.sem..enrolled.",
"Curricular.units.1st.sem..evaluations.", "Curricular.units.1st.sem..approved.",
"Curricular.units.1st.sem..grade.", "Curricular.units.1st.sem..without.evaluations.",
"Curricular.units.2nd.sem..credited.", "Curricular.units.2nd.sem..enrolled.",
"Curricular.units.2nd.sem..evaluations.", "Curricular.units.2nd.sem..approved.",
"Curricular.units.2nd.sem..grade.", "Curricular.units.2nd.sem..without.evaluations.",
"Unemployment.rate", "Inflation.rate", "GDP"
)
# Function to clean and reformat column names
clean_column_names <- function(names) {
# Replace ".s." with "'s " (special case for possessive)
names <- gsub("\\.s\\.", "'s ", names)
# Replace other periods with spaces
names <- gsub("\\.", " ", names)
# Convert to title case (capitalize first letter of each word)
names <- tools::toTitleCase(names)
# Change "Nacionality" to "Nationality"
names[names == "Nacionality"] <- "Nationality"
# Keep "GDP" in all caps
names[names == "Gdp"] <- "GDP"
# Replace "Sem " with "Semester"
names <- gsub("Sem ", "Semester", names, fixed = TRUE)
return(names)
}
# Apply the function to the column names of the dataframe
colnames(data) <- clean_column_names(colnames(data))
# Change the column names for 1st and 2nd semester Grades
colnames(data)[c(24, 30)] <- c("Grade1", "Grade2")
This section will describe the univariate distributions of the key categorical, dichotomous, and continuous features in the dataset and analyze any potential issues for machine learning applications.
There are three key groups that are underrepresented within this dataset and should be carefully considered in the development of machine learning models so as not to reproduce bias. Section 3 of this report explores in more detail the relationship between these groups and academic achievement in the dataset.
# Recode binary values from numbers to categories
data$Debtor <- recode(data$Debtor, `1` = "Debt", `0` = "No Debt")
data$`Tuition Fees Up to Date` <- recode(data$`Tuition Fees Up to Date`, `1` = "Paid", `0` = "Unpaid")
data$Gender <- recode(data$Gender, `1` = "Male", `0` = "Female")
# Count the number of occurrences for each category
debtor_counts <- table(data$Debtor)
tuition_counts <- table(data$`Tuition Fees Up to Date`)
gender_counts <- table(data$Gender)
# Create data frames from the counts for each category
debtor_df <- as.data.frame(debtor_counts)
tuition_df <- as.data.frame(tuition_counts)
gender_df <- as.data.frame(gender_counts)
# Create the pie charts with ggplot2
# Debtor Pie Chart
pie_debtor <- ggplot(debtor_df, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") + # Convert bar chart to pie chart
scale_fill_manual(values = c("#4C57B8", "#AFB1F0")) +
theme_void() + # Remove background and axes
guides(fill = guide_legend(title = NULL)) + # Add legend and remove title
ggtitle("Debtor Status") # Remove individual title
# Tuition Fees Pie Chart
pie_tuition <- ggplot(tuition_df, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") + # Convert bar chart to pie chart
scale_fill_manual(values = c("#F5A1B5", "#ac1f67")) +
theme_void() + # Remove background and axes
guides(fill = guide_legend(title = NULL)) + # Add legend and remove title
ggtitle("Tuition Fees") # Remove individual title
# Gender Pie Chart
pie_gender <- ggplot(gender_df, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") + # Convert bar chart to pie chart
scale_fill_manual(values = c("#E6DE3C", "#B9A520")) +
theme_void() + # Remove background and axes
guides(fill = guide_legend(title = NULL)) + # Add legend and remove title
ggtitle("Gender")
# Combine the pie charts into a grid with a centered main title
grid.arrange(
pie_debtor, pie_tuition, pie_gender,
ncol = 3,
heights = unit(c(1, 0.1), "npc") # Allocate space for the legends
)
This dataset contains the previous academic qualifications of the student, mother, and father. For students, 84.02% have completed secondary education. The parental qualification distributions do not reflect that of the students: There is a wider range of educational categories for each of the parent groups (likely due to the disparate sources of data). When recoded and grouped into three main categories (secondary education, higher education, and other), the distribution of parent qualifications in the dataset indicates that most of the parent qualifications fall into the “other” category. More recoding is required to classify qualifications in order to examine the relationship between parent qualification and student grades.
# Create a dataframe with 4 rows and 34 columns. First column is Var1, 1:34
qual <- data.frame(matrix(1:34, nrow = 34, ncol = 1))
colnames(qual) <- c("Var1")
# Calculate the count of each category
S_qual <- data.frame(table(data$`Previous Qualification`))
M_qual <- data.frame(table(data$`Mother's Qualification`))
F_qual <- data.frame(table(data$`Father's Qualification`))
# Merge by the common column "Var1"
qual <- merge(qual, S_qual, by = "Var1", all=TRUE)
qual <- merge(qual, M_qual, by = "Var1", all=TRUE)
qual <- merge(qual, F_qual, by = "Var1", all=TRUE)
colnames(qual) <- c("Qualification", "Student", "Mother", "Father")
# Replace NAs with 0
qual[is.na(qual)] <- 0
# Pivot the data to long format
qual_long <- qual %>%
pivot_longer(cols = c(Student, Mother, Father),
names_to = "Role",
values_to = "Count")
# Filter for only Mother and Father roles
qual_long <- qual_long %>%
filter(Role %in% c("Mother", "Father"))
# Group and rename the 'Qualification' categories
qual_long <- qual_long %>%
mutate(Qualification = case_when(
Qualification == 1 ~ "Secondary Education",
Qualification >= 2 & Qualification <= 6 ~ "Higher Education Degree",
Qualification >= 7 & Qualification <= 34 ~ "Other",
TRUE ~ as.character(Qualification) # In case there are any NA or unrecognized values
))
# Aggregate the data by Qualification and Role to get total count
qual_long_aggregated <- qual_long %>%
group_by(Qualification, Role) %>%
summarise(TotalCount = sum(Count), .groups = 'drop')
# Create the ggplot object for stacked bar plot
p <- ggplot(qual_long_aggregated, aes(x = Qualification, y = TotalCount, fill = Role, text = paste("Total Count: ", TotalCount))) +
geom_bar(stat = "identity") + # Use stat="identity" to use the actual counts
labs(x = "Qualification", y = "Count", fill = "Role") +
ggtitle("Distribution of Parent Qualifications (Grouped by Education Level)") + # Add title here
scale_fill_manual(values = c("Mother" = "#FFC20A", "Father" = "#0C7BDC")) + # Set custom colors
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Convert the ggplot object to a plotly object for interactivity
interactive_plot <- ggplotly(p, tooltip = "text")
# Show the interactive plot
interactive_plot
The grade features for each semester contain both missing values and values of zero. When grades of zero are excluded, the grade distributions are approximately normally distributed. Data imputation could replace the missing values, or another binary feature can be created to retain the information of students with a grade of zero for each semester.
This section examines the relationships between key categorical and numerical features in the dataset.
Grade average in the first semester and second semester appear to have a positive linear relationship. Based on this model, for every one-unit increase in first semester grade average, we expect to see a 0.7 increase in second-semester grade average.
# Remove rows where Grade1 or Grade2 is zero
filtered_data <- data[data$Grade1 != 0 & data$Grade2 != 0, ]
# Fit a linear regression model
model <- lm(Grade2 ~ Grade1, data = filtered_data)
# Scatterplot of first-semester and second-semester average grades
plot(filtered_data$Grade1, filtered_data$Grade2,
col = "#FFC20A",
pch = 19,
main = "Average Grades",
xlab = "First Semester",
ylab = "Second Semester")
# Add the regression line
abline(model, col = "#ac1f67", lwd = 2) # Red regression line with thickness of 2
# coefficient of Grade1)
slope <- coef(model)[2]
A minority of students have debt or unpaid tuition fees. In the figure below, the distribution of average grades is stratified by debt status and also by tuition status. The students in the minority for these categories appear to have a lower grade distribution. At the 0.05 significance level, there is sufficient evidence to conclude that the mean average grades differ significantly between these majority and minority groups. Analytic tasks will include removing or imputating grade values of zero, then re-examining the relationship.
col0 = c("#FFC20A", "#0C7BDC")
# Set up layout for the boxplots
par(mfrow = c(1, 2)) # 1 rows, 2 columns
# Create two boxplots
boxplot(data$Grade1 ~ data$Debtor,
col = c("#FFC20A", "#0C7BDC"),
main = "First Semester",
xlab = " ",
ylab = "Average Grade",
cex.axis = 0.9, # Smaller font size for axis labels
cex.lab = 0.9) # Smaller font size for axis titles
boxplot(data$Grade2 ~ data$Debtor,
col = c("#FFC20A", "#0C7BDC"),
main = "Second Semester",
xlab = " ",
ylab = " ",
cex.axis = 0.9, # Smaller font size for axis labels
cex.lab = 0.9) # Smaller font size for axis titles
col1 = c("#0C7BDC", "#FFC20A")
# Set up layout for the boxplots
par(mfrow = c(1, 2)) # 1 rows, 2 columns
boxplot(data$Grade1 ~ data$`Tuition Fees Up to Date`,
col = col1,
main = "First Semester",
xlab = " ",
ylab = "Average Grade",
cex.axis = 0.9, # Smaller font size for axis labels
cex.lab = 0.9) # Smaller font size for axis titles
boxplot(data$Grade2 ~ data$`Tuition Fees Up to Date`,
col = col1,
main = "Second Semester",
xlab = " ",
ylab = " ",
cex.axis = 0.9, # Smaller font size for axis labels
cex.lab = 0.9) # Smaller font size for axis titles
# T test between categorical variable and continuous variable
invisible(t.test(data$Grade1 ~ data$Debtor, data = data))
invisible(t.test(data$Grade2 ~ data$Debtor, data = data))
# T test between categorical variable and continuous variable
invisible(t.test(data$Grade1 ~ data$`Tuition Fees Up to Date`, data = data))
invisible(t.test(data$Grade2 ~ data$`Tuition Fees Up to Date`, data = data))
Male students, accounting for 35.17% of the dataset, appear to achieve a lower average grade distribution. There is enough evidence to indicate a significant difference between this group’s mean average grade and that of female students at the 0.05 significance level. Analytic tasks will include removing or imputating grade values of zero, then re-examining the relationship.
col1 = c("#0C7BDC", "#FFC20A")
# Set up layout for the boxplots
par(mfrow = c(1, 2)) # 1 rows, 2 columns
boxplot(data$Grade1 ~ data$Gender,
col = col1,
main = "First Semester",
xlab = " ",
ylab = "Average Grade",
cex.axis = 0.9, # Smaller font size for axis labels
cex.lab = 0.9) # Smaller font size for axis titles
boxplot(data$Grade2 ~ data$Gender,
col = col1,
main = "Second Semester",
xlab = " ",
ylab = " ",
cex.axis = 0.9, # Smaller font size for axis labels
cex.lab = 0.9) # Smaller font size for axis titles
# T test between categorical variable and continuous variable
invisible(t.test(data$Grade1 ~ data$Gender, data = data))
invisible(t.test(data$Grade2 ~ data$Gender, data = data))
PUBLICATION: Martins, M.V., Tolledo, D., Machado, J., Baptista, L.M.T., Realinho, V. (2021). Early Prediction of student’s Performance in Higher Education: A Case Study. In: Rocha, Á., Adeli, H., Dzemyda, G., Moreira, F., Ramalho Correia, A.M. (eds) Trends and Applications in Information Systems and Technologies. WorldCIST 2021. Advances in Intelligent Systems and Computing, vol 1365. Springer, Cham. https://doi.org/10.1007/978-3-030-72657-7_16
DATA: M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021). Predict Students’ Dropout and Academic Success. Kaggle. Attribution 4.0 International (CC BY 4.0). https://www.kaggle.com/datasets/harshitsrivastava25/predict-students-dropout-and-academic-success/data